Chapter 4 Corpus Analysis: A Start

In this chapter, I will demonstrate how to do basic corpus analysis after you have collected data. I will show you some of the most common ways that people work with the text data.

4.1 Installing quanteda

There are many packages that are made for computational text analytics in R. You may consult the CRAN Task View: Natural Language Processing for a lot more alternatives.

To start with, this tutorial will use a powerful package, quanteda, for managing and analyzing textual data in R. You may refer to the official documentation of the package for more detail.

quanteda is not included in the default R installation. Please install the package if you haven’t done so.

install.packages("quanteda")
install.packages("readtext")

Also, as noted on the quanteda documentation, because this library compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers.

  • If you are using a Windows platform, this means you will need also to install the Rtools software available from CRAN.
  • If you are using macOS, you should install the macOS tools.

If you run into any installation errors, please go to the official documentation page for additional assistance.

library(quanteda)
library(readtext)
library(tidytext)
library(dplyr)

packageVersion("quanteda")
## [1] '2.0.1'

4.2 Building a corpus from character vector

To demonstrate a typical corpus analytic example with texts, I will be using a pre-loaded corpus that comes with the quanteda package, data_corpus_inaugural. This is a corpus of US presidential inaugural address texts, and metadata for the corpus from 1789 to present.

data_corpus_inaugural
## Corpus consisting of 58 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."
## 
## [ reached max_ndoc ... 52 more documents ]
class(data_corpus_inaugural)
## [1] "corpus"    "character"

We create a corpus() object with the pre-loaded corpus in quantedadata_corpus_inaugural:

corp_us <- corpus(data_corpus_inaugural) # save the `corpus` to a short obj name
summary(corp_us)

After the corpus is loaded, we can use summary() to get the metadata of each text in the corpus, including word types and tokens as well. This allows us to have a quick look at the size of the addressess made by all presidents.

require(ggplot2)

corp_us %>%
  summary %>%
  ggplot(aes(x = Year, y = Tokens, group = 1)) +
    geom_line() +
    geom_point() +
    theme_bw()


Exercise 4.1 Could you reproduce the above line plot and add information of President to the plot as labels of the dots?

Hints: Please check ggplot2::geom_text() or more advanced one, ggrepel::geom_text_repel()


4.3 Keyword-in-Context (KWIC)

Keyword-in-Context (KWIC), or concordances, are the most frequently used method in corpus linguistics. The idea is very intuitive: we get to know more about the semantics of a word by examining how it is being used in a wider context.

We can use kwic() to perform a search for a word and retrieve its concordances from the corpus:

kwic(corp_us, "terror")

kwic() returns a data frame, which can be easily output to a CSV file for later use.

Please note that kwic(), when taking a corpus object as the argument, will automatically tokenize the corpus data and do the keyword-in-context search on a word basis. In other words, the pattern you look for cannot be a linguistic pattern across several words. We will talk about how to extract constructions later. Also, for languages without explicit word boundaries (e.g., Chinese), this may be a problem with quanteda. We will talk more about this in the later chapter on Chinese Texts Analytics.

4.4 KWIC with Regular Expressions

For more complex searches, we can use regular expressions as well in kwic(). For example, if you want to include terror and all its other related word forms, such as terrorist, terrorism, terrors, you can do a regular expression search.

kwic(corp_us, "terror.*", valuetype = "regex")

By default, the kwic() is word-based. If you like to look up a multiword combination, use phrase():

kwic(corp_us, phrase("our country"))

It should be noted that the output of kwic includes not only the concordances (i.e., preceding/subsequent co-texts + the keyword), but also the sources of the texts for each concordance line. This would be extremely convenient if you need to refer back to the original discourse context of the concordance line.


Exercise 4.2 Please create a bar plot, showing the number of uses of the word country in each president’s address. Please include different variants of the word, e.g., countries, Countries, Country, in your kwic() search.


4.5 Tidy Text Format of the Corpus

So far our corpus is a corpus object defined in quanteda. In most of the R standard packages, people normally follow the using tidy data principles to make handling data easier and more effective. As described by Hadley Wickham (Wickham and Grolemund 2017), tidy data has a specific structure:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

With text data like a corpus, we can also define the tidy text format as being a data.frame with one-token-per-row. A token is a meaningful unit of text, such as a word that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.

In computational text analytics, the token (i.e., each row in the data frame) is most often a single word, but can also be an n-gram, sentence, or paragraph. The tidytext package in R is made for the handling of the tidy text format of the corpus data.

Tidy datasets allow manipulation with a standard set of tidy tools, including popular packages such as dplyr, tidyr, and ggplot2.

Figure 4.1: Computational Text Processing Flowchart

The tidytext package includes functions to tidy() objects from quanteda.

library(tidytext)
corp_us_tidy <- tidy(corp_us) # convert `corpus` to `data.frame`
class(corp_us_tidy)
## [1] "tbl_df"     "tbl"        "data.frame"

4.6 Frequency Lists

4.6.1 Word (Unigram)

To get a frequency list of words, word tokenization is an important step for corpus analysis because words are a meaningful linguistic unit in language. Also, word frequency lists are often indicative of many important messages.

The tidytext provides a powerful function, unnest_tokens() to tokenize a data frame with larger linguistic units (e.g., texts) into one with smaller units (e.g., words). That is, the unnest_tokens() convert a text-based data frame (each row is a text document) into a token-based data frame(each row is a token splitted from the text).

corp_us_words <- corp_us_tidy %>%
  unnest_tokens(output = word, 
                input = text, 
                token = "words") # tokenize the `text` column into `word`

corp_us_words

The unnest_tokens() is optimized for English tokenization of other linguistic units, such as words, ngrams, sentences, lines, and paragraphs (check ?unnest_tokens()). To handle Chinese data, however, we need to define own ways of tokenization unnest_tokens(…, token = …). We will discuss the principles for Chinese text processing in a later chapter.

Please note that by default, token = “words” would normalize the texts to lower-casing letters. Also, all the non-word tokens are automatically removed. If you would like to preserve the casing differences and the punctuations, you can include the following arguments in unnest_tokens(…, token = “words”,strip_punct = F, strip_numeric = F).

Now we can count the word frequencies:

corp_us_words_freq <- corp_us_words %>% 
  count(word, sort=TRUE)

corp_us_words_freq

4.6.2 Bigrams

Frequency lists can be generated for bigrams or any other multiword combinations as well:

corp_us_bigrams <- corp_us_tidy %>%
  unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)

corp_us_bigrams

To create bigram frequency list:

corp_us_bigrams_freq <- corp_us_bigrams %>% 
  count(bigram, sort=TRUE)
corp_us_bigrams_freq
sum(corp_us_words_freq$n) # size of unigrams
## [1] 135562
sum(corp_us_bigrams_freq$n) # size of bigrams
## [1] 135504
Exercise 4.3 The function unnest_tokens() does a lot of work behind the scene. Please take a closer look at the outputs of unnest_tokens() and examine how it takes care of the case normalization and punctuations within the sentence. Will these treatments affect the frequency lists we get in any important way? Please elaborate.

4.6.3 Ngrams (Lexical Bundles)

corp_us_trigrams <-  corp_us_tidy %>%
  unnest_tokens(trigrams, text, token = "ngrams", n = 3)

corp_us_trigrams

We then can examine which n-grams were most often used by each President:

corp_us_trigrams %>%
  count(President, trigrams) %>%
  group_by(President) %>%
  top_n(3, n) %>%
  arrange(President, desc(n))

Exercise 4.4 Please subset the top 3 trigrams of President Don. Trump, Bill Clinton, John Adams, from corp_us_trigram.


4.6.4 Frequency and Dispersion

When looking at frequency lists, there is another distributional metric we need to consider: dispersion. An n-gram can be meaningful if its frequency is high. However, this high frequency may come in different meanings. What if the n-gram only occurs in ONE particular document, i.e., used only by a particular President? Or alternatively, what if the n-gram appears in many different documents, i.e., used by many different Presidents?

The degrees of n-gram dispersion has a lot to do with the significance of its frequency.

So now let’s compute the dispersion of the n-grams in our corp_us_trigrams. Here we define the dispersion of an n-gram as the number of Presidents who have used the n-gram at least once in his address(es).

# method 1
corp_us_trigrams %>%
  count(trigrams, President) %>%
  group_by(trigrams) %>%
  summarize(FREQ = sum(n), DISPERSION = n()) %>%
  filter(DISPERSION >= 5) %>%
  arrange(desc(DISPERSION))
# method2
corp_us_trigrams %>%
  group_by(trigrams) %>%
  summarize(FREQ = n(), DISPERSION = n_distinct(President)) %>%
  filter(DISPERSION >= 5) %>%
  arrange(desc(DISPERSION))
# Arrange according to frequency
# corp_us_trigram %>%
#   count(trigrams, President) %>%
#   group_by(trigrams) %>%
#   summarize(freq = sum(n), dispersion = n()) %>%
#   arrange(desc(freq))

In particular, cut-off values are often determined to select a list of meaningful n-grams. These cut-off values include: the frequency of the n-grams, as well as the dispersion of the n-grams. A subset of n-grams that are defined and selected based on these distributional criteria (i.e., frequency and dispersion) are often referred to as Lexical bundles.


Exercise 4.5 Please create a list of four-grams lexical bundles that have been used in at least FIVE different presidential addressess. Arrange the resulting data frame according to the frequency of the four-grams.

4.7 Word Cloud

With frequency data, we can visualize important words in the corpus with a Word Cloud. It is a novel but intuitive visual representation of text data. It allows us to quickly perceive the most prominent words from a large collection of texts.

library(wordcloud)
set.seed(123)
with(corp_us_words_freq, wordcloud(word, n, 
                                   max.words = 400,
                                   min.freq = 20,
                                   scale = c(2,0.5),
                                   color = brewer.pal(8, "Dark2"),
                                   vfont=c("serif","plain")))


Exercise 4.6 Word cloud would be more informative if we first remove functional words. In tidytext, there is a preloaded data frame, stop_words, which contains common English stop words. Please make use of this data frame and try to re-create a word cloud with all stopwords removed. (Criteria: Frequency >= 20; Max Number of Words Plotted = 400)

Hint: Check dplyr::anti_join()
require(tidytext)
stop_words

Exercise 4.7 Get yourself familiar with another R package for creating word clouds, wordcloud2, and re-create a word cloud as requested in Exercise 4.6 but in a fancier format, i.e., a star-shaped one. (Criteria: Frequency >= 20; Max Number of Words Plotted = 400)

4.8 Collocations

With unigram and bigram frequencies of the corpus, we can further examine the collocations within the corpus. Collocation refers to a frequent phenomenon where two words tend to co-occur very often in use. This co-occurrence is defined statistically by their lexical associations.

4.8.1 Cooccurrence Table and Observed Frequencies

Cooccurrence frequency data for a word pair, w1 and w2, are often organized in a contingency table extracted from a corpus, as shown in Figure 4.2. The cell counts of this contingency table are called the observed frequencies O11, O12, O21, and O22.

Cooccurrence Freqeucny Table

Figure 4.2: Cooccurrence Freqeucny Table

The sum of all four observed frequencies (called the sample size N) is equal to the total number of bigrams extracted from the corpus. R1 and R2 are the row totals of the observed contingency table, while C1 and C2 are the corresponding column totals. The row and column totals are also called marginal frequencies, being written in the margins of the table, and O11 is called the joint frequency.

4.8.2 Expected Frequencies

Equations for all association measures are given in terms of the observed frequencies, marginal frequencies, and the expected frequencies E11, …, E22. Expected frequencies refer to the expected number of co-occurrences under the null hypothesis that W1 and W2 are statistically independent. The expected frequencies can easily be computed from the marginal frequencies as shown in Figure 4.3.

Computing Expected Frequencies

Figure 4.3: Computing Expected Frequencies

Maybe it would be easier for us to illustrate this with a simple example:

Computing Expected Frequencies

Figure 4.4: Computing Expected Frequencies

How do we compute the expected frequencies of the four cells?

Computing Expected Frequencies

Figure 4.5: Computing Expected Frequencies

example <- matrix(c(90, 10, 110, 290), byrow=T, nrow=2)

Exercise 4.8 Please compute the expected frequencies for the above matrix example in R.

4.8.3 Association Measures

The idea of lexical assoication is to measure how much the observed frequencies deviate from the expected. Some of the metrics (e.g., t-statistic, MI) consider only the joint frequency deviation (i.e., O11), while others (e.g., G2, a.k.a Log Likelihood Ratio) consider the deviations of ALL cells.

Here I would like to show you how we can compute the most common two asssociation metrics for all the bigrams found in the corpus: t-test statistic and Mutual Information (MI).

  • \(t = \frac{O_{11}-E_{11}}{\sqrt{E_{11}}}\)
  • \(MI = log_2\frac{O_{11}}{E_{11}}\)
  • \(G^2 = 2 \sum_{ij}{O_{ij}log\frac{O_{ij}}{E_{ij}}}\)
corp_us_bigrams_freq %>% head(10)
corp_us_collocations <- corp_us_bigrams_freq %>%
  filter(n > 5) %>% # set bigram frequency cut-off
  rename(O11 = n) %>%
  tidyr::separate(bigram, c("w1", "w2"), sep="\\s") %>% # split bigrams into two columns
  mutate(R1 = corp_us_words_freq$n[match(w1, corp_us_words_freq$word)],
         C1 = corp_us_words_freq$n[match(w2, corp_us_words_freq$word)]) %>% # retrieve w1 w2 unigram freq
  mutate(E11 = (R1*C1)/sum(O11)) %>% # compute expected freq of bigrams
  mutate(MI = log2(O11/E11),
         t = (O11 - E11)/sqrt(E11)) %>% # compute associations
  arrange(desc(MI)) # sorting

corp_us_collocations

Please note that in the above example, we compute the lexical associations for bigrams whose frequency > 5. This is necessary in collocation studies because bigrams of very low frequency would not be informative even though its association can be very strong. However, the cut-off value can be arbitrary, depending on the corpus size or researchers’ considerations.

How to compute lexical assoications is a non-trivial issue. There are many more ways to compute the association strengths between two words. Please refer to Stefan Evert’s site for a very comprehensive review of lexical assoication meaasures.


Exercise 4.9 Sort the collocation data frame corp_us_collocations according to the t-score and compare the results sorted by MI scores. Please describe what you find.

Exercise 4.10 Based on the formula provided above, please create a new column for corp_us_collocations, which gives the Log-Likelihood Ratios of all the bigrams.

When you do the above exercise, you may run into a couple of problems:

  • Some of the bigrams have NaN values in their LLR. This may be due to the issue of NAs produced by integer overflow. Please solve this.
  • After solving the above overflow issue, you may still have a few bigrams with NaN in their LLR, which may be due to the computation of the log value. In Math, how do we define log(1/0) and log(0/1)? Do you know when you would get an undefined value NaN in the computation of log()?
  • To solve the problems, please assign the value 0 if the log returns NaN values.

Exercise 4.11

  1. Find the top FIVE bigrams ranked according to MI values for each president. The result would be a data frame as shown below.

  2. Create a plot as shown below to visualize your results.

4.9 Constructions

We are often interested in the use of linguistic patterns, which are beyond the lexical boundaries. My experience is that usually it is better to work with the corpus on a sentential level.

We can use the same tokenization function, unnest_tokens() to convert our text-based corpus data frame, corpus_us_tidy, into a sentence-based tidy structure:

corp_us_sents <- corp_us_tidy %>%
  unnest_tokens(output = sentence, 
                input = text, 
                token = "sentences") # tokenize the `text` column into `sentence`
corp_us_sents

With each sentence, we can investigate particular constructions in more detail. Let’s assume that we are interested in the use of Perfect aspect in English by different presidents. We can try to extract Perfect constructions (including Present/Past Perfect) from each sentence using the regular expression.

Here we make a simple naive assumption: Perfect constructions include all have/has/had + VERB-en/ed tokens from the sentences.

require(stringr)
# Perfect
corp_us_sents %>%
  unnest_tokens(perfect, 
                sentence,
                token = function(x) str_extract_all(x, "ha[d|ve|s] \\w+(en|ed)")) -> result_perfect
result_perfect

In the above example, we specify the token= argument in unnest_tokens(…, token = …) with a self-defined function. The idea of tokenization in unnest_tokens() is that the token argument should be a function which takes a text-based vector as input (i.e, each element of the input vector may be a document text) and returns a list, each element of which is a token-based version (i.e., vector) of the original input vector element (see Figure below).

In our demonstration, we define a tokenization function, which takes sentence as the input and returns a list, each element of which consists a vector of tokens matching the regular expressions in individual sentences in sentence. (Note: The function object is not assigned to an object name, thus never being created in the R working session.)


Intuition for `token=` in `unnest_tokens()`

Figure 4.6: Intuition for token= in unnest_tokens()


And of course we can do an exploratory analysis of the frequencies of Perfect constructions by different presidents:

require(tidyr)
# table
result_perfect %>%
  group_by(President) %>%
  summarize(TOKEN_FREQ = n(),
            TYPE_FREQ = n_distinct(perfect))
# graph
result_perfect %>%
  group_by(President) %>%
  summarize(TOKEN_FREQ = n(),
            TYPE_FREQ = n_distinct(perfect)) %>%
  pivot_longer(c("TOKEN_FREQ", "TYPE_FREQ"), names_to = "STATISTIC", values_to = "NUMBER") %>%
  ggplot(aes(President, NUMBER, fill = STATISTIC)) +
           geom_bar(stat = "identity",position = position_dodge()) +
  theme(axis.text.x = element_text(angle=90))

There are quite a few things we need to take care of more thoroughly:

  1. The auxilliary HAVE and the past participle do not necessarily have to stand next to each other for Perfect constructions.

  2. We now lose track of one important information: from which sentence of the Presidental addressess did we collect each Perfect constructional token?

Any ideas how to solve all these issues?

Exercise 4.12 Please create a better regular expression to retrieve more tokens of English Perfect constructions, where the auxilliary and participle may not stand together.


Exercise 4.13 Re-generate a result_perfect data frame, where you can keep track of:

  • From the N-th sentence of the address did the Perfect come? (e.g., SENT_ID column below)
  • From which president’s address did the Perfect come? (e.g., INDEX column below)
You may have a data frame as shown below.
Exercise 4.14 Re-create the above bar plot in a way that the type and token frequencies are computed based on each address and the x axis should be arranged accordingly (i.e., by the years and presidents). Your resulting graph should look similar to the one below.

References

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.